# Metadata

Replication instructions for: 

- Brian Libgober, Connor T. Jerzak. Linking Datasets on Organizations Using Half-a-Billion Open-Collaborated Records. To appear in *Political Science Methods and Research*, 2024.

See `ms.pdf` for a PDF of the manuscript text containing the final figures. 

# Repository Introduction
 
The repository contains four folders. 

- `Analysis` contains all analysis code (see Code Introduction for more information). 
- `Data Inputs` contains all data inputs. 
- `DataOutputs` contains all data outputs from the code in `Analysis`. 
- `Figures` contains all figures as presented in the final version of the manuscript, `ms.pdf`

# Code Introduction 

The main analysis script is run as `LinkOrgs_EmpiricalExamples_Master.R`. It calls other scripts, with the following structure outlined here: 

`LinkOrgs_EmpiricalExamples_Master.R` (manages analysis pipeline)
-`LinkOrgs_EmpiricalExamples_Run.R` (manages application of specific method applied to a specific dataset) 
--`LinkOrgs_ProcessExampleData.R` (manages generation of the example data)
-`LinkOrgs_EmpiricalExamples_Evaluate.R` (manages evaluation of the analysis results)
--`LinkOrgs_AccuracyPlots.R` (manages plotting of the analysis results) 

In `LinkOrgs_EmpiricalExamples_Master.R`, the parameter, `JustEvaluateResults`, controls whether to run the full analysis (~24-48 hours, depending on hardware) if set to `FALSE`, or whether to only re-create figures given the analysis results (~1-2 minutes) if set to `TRUE`. Default is `TRUE`. `LinkOrgs_EmpiricalExamples_Master.R` generates Figures 7-9 and Table 3. Note that Table 3 will show different results for different computer hardware based on number of CPU cores and GPU availability. Also, the full analysis contained in `LinkOrgs_EmpiricalExamples_Master.R`creates some figures not used in the main analysis, but useful for diagnostic purposes. 

A secondary anlaysis script is run as `JaxTransformer_Inference.R`. This script can be run in `R`'s interactive model. This script generates Figure 3 and Figure 4.

# If re-running full evaluation suite setting `JustEvaluateResults=F`

If setting `JustEvaluateResults` to `FALSE`, the following steps also should be run. 

1. Install `LinkOrgs-software` and build computational backend. 
```
# install package 
devtools::install_github('cjerzak/LinkOrgs-software/LinkOrgs')
  
# build LinkOrgs backend (do this only once after installing the latest version of LinkOrgs)
LinkOrgs::BuildBackend(conda_env = "LinkOrgsEnv", tryMetal = T) 
# try to see if the jax-metal backend can be installed on Mac setting tryMetal=T; if this fails, set tryMetal = F
  ```
  
2. Install `DeezyMatch` (used as a baseline). See `https://github.com/Living-with-machines/DeezyMatch?tab=readme-ov-file`. In particular, follow these steps with your working directory set to the replication repository: 
```
# Clone DeezyMatch to your Downloads folder: 
# conda create -n py39deezy python=3.9
# git clone https://github.com/Living-with-machines/DeezyMatch.git
# Install DeezyMatch dependencies:
# cd /path/to/my/DeezyMatch
# pip install -r requirements.txt

# then install deezy 
# cd /path/to/my/DeezyMatch
# python setup.py install

# then, download PyTorch V1.9.0
# python3 -m pip install torch==1.9.0

# then, install jax
# python3 -m pip install jax jaxlib
```

3. Finally, if running the full evaluation suite with `JustEvaluateResults` set to `FALSE`, note that the analysis cannot be run as a normal `for` loop given the incompatible dependencies (and therefore different `conda` environments) used for the `DeezyMatch` and `LinkOrgs` approaches. We therefore have to run the loop using `gnu parallel` (which can be downloaded using `brew` on Mac or `sudo` on Linux). 

After installing `parallel`, we run the full evaluation suite as so. First, change `cd ~/Dropbox/Directory/Analysis` in the `RunParallel_LinkOrgsApps.sh` file to the location of the replication repository. Next, navigate to the terminal and run: `sh RunParallel_LinkOrgsApps.sh` This way, the analysis loop will be run with new R processes (which will be terminated upon task completion by the `parallel` manager). 

If running the full analysis, I would recommend first setting `JustRunTestAnalysis` to `TRUE` to help debugging (`JustRunTestAnalysis` runs the analysis suite on the smallest data sample). 

# Other notes
The original analysis was run on a machine with a 32-core i9 Intel CPU (32 GB RAM) running Linux (Pop!_OS is 22.04 LTS), R (v4.4.0),  Python (v3.12), and conda (v24.5.0).

The analysis used Python libraries JAX v0.4.26, Equinox v0.11.4, Optax v0.2.2, tensorflow v2.16, jmp v0.0.4, `tensorflow_probability` v0.23, jaxlib v0.4.6, `ml_dtypes` v0.4.0, and numpy v1.26.4.

The analysis used external R libraries data.table v1.15.4, fastmatch v1.1-4, plyr v1.8.9, dplyr v1.1.4, stringdist v0.9.10, stringr v1.5.1, foreach v1.5.2, doParallel v1.0.17, lmtest0.9-40, sandwich v3.1-0, sfsmisc v1.1-18, and stargazer v5.2.3. 

*Note to PSRM Replication Team:* If any additional information is needed for these replication files, don't hesitate to reach out to `connor.jerzak@austin.utexas.edu`; we can set up a Zoom meeting to discuss any aspect of setting up the computational environments. 

